A Lower Bound for Estimating High Moments of a Data Stream 



Sumit Ganguly 
Indian Institute of Technology, Kanpur 

t-h ; 

■ Abstract 
CN ■ 

| We show an improved lower bound for the F p estimation problem in a data stream setting 

for p > 2. A data stream is a sequence of items from the domain [n] with possible repetitions. 
! The frequency vector x is an n-dimensional non-negative integer vector x such that x{i) is the 

_^ ' number of occurrences of i in the sequence. Given an accuracy parameter f2(n -1 / p ) < e < 1, 

CO . the problem of estimating the pth moment of frequency is to estimate = $3ie[n] l a; (*)l 1 ' 

correctly to within a relative accuracy of 1 ± e with high constant probability in an online 

■ fashion and using as little space as possible. The current lower bound for space for this problem 
Q I is r2(n 1 - 2 /P£- 2 /P + 7i 1 - 2 /Pe- 4 /P/log° (1) (n) + (e- 2 + log(n))). The first term in the lower bound 

^ ■ expression was proved in the second in [B] and the third in [5]. In this note, we show an 

O ! n(p 2 n 1 ^ 2 /Pe- 2 / log(n)) bits space bound, for ft(P™~ 1/p ) < e < 1/10. 

■ 1 Introduction 

co : 

In the insert-only data streaming model, a stream is modeled as a sequence of items i\, 12, ■ ■ ■ , where 
<^ • the items come from a large domain [n] = {1, 2, . . . , n}. The frequency vector is an n-dimensional 
. vector x whose ith coordinate x(i) counts the number of occurrences of i in the sequence. Each 
new arrival of an item ij increments x(ij) to x(ij) + 1. Define ||a;||p = X^e[n]l x *l P - P^h 
moment estimation problem, with accuracy parameter e, is to design a structure that can process 
the stream sequence in an online fashion and return a real value F p satisfying \P p — \\x\\p\ < e||x||p 
k>( ' with probability 9/10. The estimate F p may use only the structure and not the original stream, that 
is, a stream may be processed in an online fashion only. The F p estimation problem has played 
a pivotal role in the study of data streaming algorithms. It was first posed and studied by Alon, 
Matias and Szegedy [Tj. They showed that for all p 7^ 1, a deterministic e- accurate F p estimation 
with e < 1/8 requires £l(n) bits, as does a randomized algorithm with no error. This reduces 
the scope to approximate randomized algorithms or randomized PTAS. A series of works [H [H [3] 
culminated in showing a lower bound of f2(n 1-2 / p e -2 / p ) bits for e-accurate F p estimation. Very 
recently, Woodruff and Zhang in [6] improve this bound to ^(n 1 - 2 /^- 4 ^) bits, where, fi(/(n,e)) 
denotes f(n,e)/log°^ 1 \n/e). Woodruff in [5] shows an 0(e -2 + log(ra)) bits bound for F p , for all 
p 7^ 1. So, the current lower bound for F p estimation in bits is: 

nfn 1 - 2 ^- 2 /* + " " + e" 2 + log(n)) 
v log v ' n ' 

In this note, we show a lower bound of fJ(p 2 re 1-2 / p e~ 2 /log(ra)) bits for this problem, improving 
upon the current known bounds. 



1 Jayram and Woodruff show fl(e 2 log(n)) bits bound when deletions are also allowed, for all p > 0. 
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2 Lower Bound 



We will reduce the standard t-party set disjointness problem to F p estimation. The problem t-DISJ 
is as follows: the instance is a collectionof t sets Si, . . . , St, each subset of [n], where, the set 
Si is given to the ith party with the promise that the set family is either pair- wise disjoint, or, 
Si C\ ... C\ St has exactly one element in common. We denote the ith coordinate of a vector x by 
x(i)\ so x = . . . ,x(n)]. With this notation, an instance of t-DISJ consists of n-dimensional 

binary vectors xi, . . . , xt, where, x r is given to the rth party and is interpreted as the characteristic 
vector of the set S r . The promise is that either, (a) xi + . . . + Xt is a binary vector (the disjoint 
case), or, (2) there is exactly one index i such that xi(i) = X2(i) = ■ ■ ■ = Xt(i) = 1 (the common 
element case). It is well-known that any one-way randomized communication protocol that solves 
t-DISJ with probability at least 7/8 requires Q(n/t) bits [2j[3]- We show the following theorem. 

Theorem 1 For 2 < p < n 1 / p /2 and max(80p/n 1 / p , 3/y/n) < e < 1/4, an algorithm that estimates 
F p with relative error of e/10 and with probability 19/20 uses space ^( ^rj^7^r ) bits. 

Proof We present a randomized one-way communication protocol for t-DISJ that is correct with 
probability 9/10, where, t = \en 1 / p /(2p)~\ . The protocol uses two structures that can process stream 
updates, one for estimating F p to within a factor of 1 ± e/10 with confidence 1 — l/(20n), and, the 
second for estimating Fq to within a factor of 1 ± e/10 with probability 19/20. 

A one-way protocol for t-DISJ is as follows. Consider an instance of t-DISJ. Party 1 inserts xi 
into each of the structures for estimating F p and Fq and sends the pair of structures to the second 
party. This party further adds its vector X2 into the two structures received and then relays it 
to the third party, and so on, in sequence. Finally, the tth party inserts its own vector into the 
structures obtained from t — 1st party. It then uses the procedure InferDisj of Figure [U to infer 
whether the instance is pair- wise disjoint or has a common element. 

We first show that the procedure InferDisj is correct with probability at least 9/10. Define 
the event GoodF as F e (1 ± e/10)||x|| , so, GoodF holds with probability 19/20. Let x = 
x i + x 2 + • • • + Xt- Say that i is a heavy item in x if x{i) = t. Procedure InferDisj obtains an 
estimate F^ obtained by applying the F p estimation algorithm to the vector x + v}' p ei (in parallel, 
for each i). Given x and an index i, we consider three cases. Assume 3p < n. 

Case 1: x has no heavy item, that is, x is a binary vector. So, x + n 1//p ej = x' + (n 1//p + x(i))ei, 
where, x' is a binary vector with x'(i) = 0. Hence, \\x'\\q = \\x\\q — x(i) and 



x + n yp ei \\p = 



x'Wo + in 1 ^ + x(i)) 

x \\ 0+ne *V)p/n^_ 

x\\o + n(l + bp/ (4n 
x|| + n(l + e/64), 



< 



< 



x(i) 

L/P)) 



assuming p < n 




< 



since, 5p/(4n 1 / p ) < e/64. 



So with probability 1 — l/(20n), and conditional on GoodFq 



P;< (1 + e/10) Hx + n 1 /^ ||p 

< (l + e/10)(||x|| +n(l + e/64)), from Q 
<F + ra(l + e/8) . 



(2) 
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procedure InferDisj 

Input: Given Fq and F p sketches of x = x± + . . . + Xt (integer n-dimensional vector) such that 

1. For x, one of the two cases hold: Disjoint : x € {0, l} n , or, Common Element: there exists 
exactly one i such x(i) = t = \en 1 l p / '(2p)~\ and the remaining x(j)'s are either or 1. 

2. F 6 (1 ± e)F with probability 19/20, and, F p e(l± e)F p with probability 1 - l/(20ra). 

Output: Returns common element i if the input is identified to fall in the Common Element case 
and the item with frequency t is identified as i, and, returns Disjoint if the input is identified to 
fall in the Disjoint case. 

1. Fq = Estimate for ||x||o- 

2. for i := 1 to n in parallel do { 

3. insert {i,n l l p ) to F p sketch 

4. Obtain F p 

5. if F p > F + n(l + 2e/5) then 

6. return common element i 
7- } 

8. return DISJOINT 



Figure 1: Solving Set Disjointness using F p and F sketches 

Case 2: x has a (unique) heavy item whose index is j ^ i. Then, x + n l / p ei = x' + te? + (n 1 ^ + 
x(i))ei, where, x'(i) = x'(j) = 0. Hence, ||cc'||o = ||ai||o — 1 — x(i) and 

\\x + n l ' p e l \\ p p = \\x\\o - 1 - x{i) + t p + (n 1 ^ + 

<IN|o-l + 7^ + ne^/« 1/P -x(i) 

< Ikllo - 1 + + ^(1 + 5p/(4n 1 / p )) - 1, as in ©. 

<No + n(l + ^ + ^) (3) 

In the second to last step, we use e p < e(l/4) p_1 and (2p) p > 4 P since p > 2. In the last step, 
we make use of the assumption that 4 ^fy p < ^. Hence, with probability 1 — l/(20n), and 
conditional on GoodFo, 

F;<(l + e/10)||x + n 1/p ei |l£ 

< (l + e/10)(||x|| + n(l+e/32)), from © 

< (l + e/10)||x|| + n(l + 2e/17), using e < 1/4 

<F + n(l + e/7) (4) 

Case 3: x has a (unique) heavy item with index i. Then, x + n 1//p ej = x' + (n 1 ^ + en 1 / p /(2p))ej, 
where, x' is a binary vector with x'(i) = 0. Hence, ||x'||o = ||x||o — 1 and 

||x + nVveiW* = \\x'\\o + (n^ p + t) p = ||x|| - 1 + n(l + > ||x|| + n(l + e/2) (5) 
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The last step uses a two-term Taylor expansion of (l + a) p around a = to obtain, + > 
1 + pa + p{p - l)a 2 /2. Setting a = we get (1 + a) p > 1 + § + ^ - g£ > 1 + § + iff, 
since, p > 2. So, n(l + — 1 > n(l + |), since, e > 3/yJn. 

The procedure InferDisj estimates ||x + n 1//p ej||p using the assumed F p estimation proce- 
dure. So with probability 1 — l/(20ra), and conditional on GoodFo, 

F;>(1- e/10)\\x + n^VaWl > (1 - e/10)(||a;||o + n(l + e/2)) > F + n(l + 2e/5), (6) 

Define the event GoODF p to hold if F^ G (1 ± e/10)||x + ra 1//p ei||p, for each i G [n]. By union 
bound, GoODFp holds with probability 19/20. Similarly, GoodF holds with probability 19/20. 
Hence, both GoodF p and GoodF hold simultaneously with probability 1 - 2/20 + 1/400 > 9/10. 
Assume GoodF p and GoodFo hold. Then, procedure InferDisj is correctly able to distinguish 
Case 3 from Cases 1 or 2, since for Case 3, F p > F$ + n(l + 2e/5) and for Cases 1 and 2, F p < 
^0 + ^(1 + 6/7). Case 3 corresponds to the common element case when the common element is i. 
Case 2 corresponds to the common element case but the common element is not i. Finally Case 
1 corresponds to the pair-wise disjoint sets case. (Cases 1 and 2 cannot be distinguished by the 
algorithm). So, assuming GoodF and GoodF p , the check for F p > F + n(l + 2e/5) in parallel 
succeeds only when the sets have i as the common element. The check fails for all other values of i. 
If the sets are pair-wise disjoint, then, the check fails again. Since both GoodFo and GoodF p hold 
with probability 9/10, it follows that procedure InferDisj solves i-DISJ with probability 9/10. 

By the work of [21 [3], any protocol for solving i-DISJ requires a total communication of Q(n/t) 
bits. Let S(e) be the total space used by the protocol proposed above. Then, 

S(e)t = Sl(n/t), or, S{e) = n(n/t 2 ) = n{p 2 n 1 ~ 2 ' p / 1 2 ) 

The space S(e) = So(e) + S p (e), where, So is the space required for a (1 ± e/10) approximation of 
-Fo with high constant confidence and S p (e) is the space required for a (1 ± e/10)-approximation 
of F p with confidence 1 — l/(20n). Since the above protocol does not involve deletions, from [4], 
5 (e) = 0(e^ 2 + logn) bits. Hence, 5(e) = S p (e) + 5 (e) > n(p 2 n 1_2 /P e" 2 )), or, 

S P (e) > tttfn 1 - 2 /^- 2 )) - 5 (e) 

= Sl(p 2 n l - 2 l p e- 2 ) - 0(e~ 2 + logn) 
= Q{p 2 n l - 2 ' p e- 2 - log(n)) 

Since S p (e) is the space used for estimating F p to within 1 ± e/10 with confidence 1 — l/(20n), it 
follows that the space required for estimating F p to within 1 ± e/10 with confidence 1 — 1/20 is 
lower bounded by 

Vlog(n)/ \e z log(n) / Ve^log(n)/ 
where the last equality follows since there is an Q(e~ 2 + log(n)) bound for the problem. 
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